Coding for DS and DM
R coding module

Lecture 1

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Useful Information: Instructor

  • Name: Andrea Cappozzo
  • Department: Department of Economics, Management and Quantitative Methods
  • Office: Room 29, Department of Economics, Management and Quantitative Methods, Via Conservatorio 7, 20122 Milano MI
  • Email: andrea.cappozzo@unimi.it
  • Office Hours: Please send an email to arrange a meeting.

Useful Information: teaching assistant

Useful Information: class schedule

  • Wednesday, 12:30PM-2:30PM, Classroom 304 (Via Celoria)
  • TBD lab session (online)

Any changes to the schedule will be communicated by the instructor through the Ariel portal.

Useful Information: Webpage and Emails

  • Ariel Portal: A section dedicated to the course is available where you can find announcements, study support materials, R scripts, and more.

  • Course Website: https://myariel.unimi.it/course/view.php?id=2950

  • Email Communication: All communication with instructors should be done through students’ institutional email.

    • Subject Line: Use “[RCODING2024]-” as the subject line for all emails related to the course.

Module level and shape

  • Impossible to see everything!
  • R is continuously evolving, so trying to explain everything does not make sense.
  • We see topics useful for data science and statistics.
  • Warning 1: basic statistical knowledge required!
  • Warning 2: basic software developing skills welcome but not necessary!
  • Warning 3: difference with the other modules - i.e., more statistics involved!

Course goal

To present R programming features for data science and statistical applications

  • coding examples.
  • data management examples.
  • statistical applications.

Module contents

  • Introduction to R framework and R Studio
  • Basic data types, data structures and operations
  • Control structures and custom functions
  • Inference elements
  • Data acquisition
  • Basic data visualization
  • Quarto
  • Building R packages
  • Further topics… (well, maybe!)

Final exam: R coding

  • Written exam on the considered topics (both lecture and labs!)
  • 30 minutes of closed forms exercises
  • integrated with the other modules
  • for all the details refer to the Exam organization and Exam modality sections on the course webpage

Computer software for statisticians: R

R is an integrated set of software resources for data manipulation, computation, and graphical visualization.

Why should we use R?

  • Efficient data manipulation and storage;
  • A comprehensive, consistent, and integrated collection of tools for intermediate data analysis;
  • Graphical resources for data analysis;
  • Well-developed, straightforward programming language;
  • The community is very large (e.g., #rstats on Twitter), diverse, and there are numerous groups (e.g., MilanoR) and conferences (e.g., useR);
  • Cutting-edge tools: typically, a new work/methodology is presented along with an R package.

Why shouldn’t we use R?

  • There are numerous peculiarities and exceptions that need to be remembered.
  • Even basic functions often have an inconsistent structure.
  • Typically, since it is used by statisticians rather than computer scientists, so-called best practices are often ignored.
  • An R script can be much slower than its equivalent in C++, Python, Julia, or similar languages.

R history

R is a dialect of S, developed by John Chambers and colleagues at Bell Labs starting in 1976. In 1988, it was completely rewritten in C. John Chambers once stated:

We wanted users to be able to start in an interactive environment where they weren’t aware they were programming. As their needs became clearer and their skill level increased, there should be a smooth transition toward programming, where the language and system aspects would become more relevant.

Brief History

  • 1991: Created in New Zealand by Ross Ihaka and Robert Gentleman.
  • 1993: Public announcement.
  • 1995: Adoption of the GNU General Public License (R becomes free software).
  • 2000: Release of version 1.0.0.
  • 2018: Release of version 3.5.2 (20/12/2018).
  • 2024: Current version 4.4.1.

Features

  • Works on all computing platforms and operating systems.
  • Very frequent releases (annual + bug-fixing releases); active development.
  • Fairly “clean” since it was made available.
  • Functionality is divided into modular packages.
  • Highly sophisticated graphical capabilities
  • Useful for working in interactive mode; provides a powerful programming language for developing new tools (e.g., Shiny, Plumber).
  • Very active user community.

Features

R is free and open source, which guarantees the following:

  • Level 0: The freedom to run the program for any purpose.
  • Level 1: The freedom to study how the program works and adapt it to your needs. Access to the source code is a prerequisite for this.
  • Level 2: The freedom to redistribute copies to help others.
  • Level 3: The freedom to improve the program and make those improvements publicly available so that the entire community benefits. Access to the source code is a prerequisite for this.

R System Design

The R system is conceptually divided into two parts:

  1. Base System: Downloadable from CRAN.
  2. The Rest of the World

Functionality is divided into packages (collections of functions, data, and compiled code in a well-defined format).
The base system contains the packages with essential functions.

Some of the packages included in the base system are: utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

R Resources

How to Learn to Use R? By Playing with It

RStudio

RStudio is an Integrated Development Environment (IDE) for R programming. You can download and install it from http://www.rstudio.com/download. It is updated once or twice a year.

RStudio IDE

R as an Object-Oriented Language

To understand computations in R, two guiding principles are useful:

  1. Everything that exists is an object.
  2. Everything that happens is a function call.

— John Chambers

Help

It’s essential to know how to use help functions in R:

  • help.start(): Opens the help system in HTML format.
  • help(mean): Provides help for the specific function
  • ?mean: An alternative way to get help for mean
  • help("for"): Retrieves help on reserved words and special characters (e.g., TRUE, FALSE, NA)
  • help.search("mean"): Searches for the string “mean” throughout the documentation.
  • ??mean: Another way to search for “mean” in the documentation.
  • ?help: Provides details on how to use the help system.

?print

Help wanted

If you find that even after reading the help documentation you still don’t understand how to use a particular function or command, here’s what you can do:

  • Check Online Resources:
    • Stack Overflow: A popular forum where you can ask questions and get answers from the community.
    • GitHub Issues: For issues related to specific packages or tools hosted on GitHub.

Help me help you

  • Create a Simple, Reproducible Example: When seeking help, it’s crucial to provide a clear and minimal example that demonstrates the problem. For further details, look at https://reprex.tidyverse.org

By doing this, you increase the likelihood of getting useful support and solutions from the community.

Workspace

  • getwd(): Retrieves the current working directory.
  • setwd(): Changes the working directory.
    • Example for Windows: setwd("c:/CorsoR")
  • ls(): Lists the objects present in the workspace.
  • rm(): Deletes one or more specified objects from the workspace.
  • rm(list=ls()): Removes all objects from the workspace. DANGEROUS COMMAND!

Packages

  • library(): Lists the installed packages in the library specified by .libPaths().
  • Many R packages developed by users are stored on CRAN: Comprehensive R Archive Network.
  • install.packages("package_name"): Installs a package from CRAN.
  • library(package_name): Loads a package into the R session.
  • Returns an error if the package is not present in the library path specified by .libPaths().

Before we Begin

  • Case Sensitivity: R distinguishes between uppercase and lowercase letters (it is case sensitive).
  • Comments: Use # to indicate a comment. You can use CTRL + Shift + C in RStudio to comment or uncomment a block of code.
  • Importance of Commenting: It is VERY important to comment your code. Describe what each function does and why you chose to use it. This is useful for others who may read your code and, importantly, for yourself when you revisit your code after some time.

Before we Begin

<− is the symbol for assignment in R. When using RStudio, you can use the shortcut Alt + - to insert this assignment operator.

Before we Begin

“There are only two hard things in Computer Science: cache invalidation and naming things.”
— Phil Karlton

When deciding on names in R, follow these basic guidelines:

  1. Use only lowercase letters and numbers.
  2. Use underscores (_) or hyphens (-) to separate words in an object name (e.g., model_one).
  3. Avoid names that might cause ambiguity.

For more details, refer to the Tidyverse Style Guide.

Naming objects

Naming objects

Not all words can be used as object names in R. There are reserved words, such as for, while, and if, which are predefined commands and cannot be renamed.

To view the complete list of reserved words, you can type the command:

?Reserved

Arithmetic Operators:

  • + : Addition
  • : Subtraction
  • * : Multiplication
  • / : Division
  • ^ : Exponentiation
  • %% : Integer division (remainder)

Relational Operators (return TRUE or FALSE):

  • > : Greater than
  • < : Less than
  • >= : Greater than or equal to
  • <= : Less than or equal to
  • == : Equal to
  • != : Not equal to

Logical Operators:

  • ! : NOT
  • & : AND (element-wise)
  • | : OR (element-wise)

Mathematical Functions:

  • log(x) : Natural logarithm of x
  • factorial(x) : Factorial of x
  • sqrt(x) : Square root of x
  • sin(x) : Sine of x (in radians)
  • cos(x) : Cosine of x (in radians)
  • tan(x) : Tangent of x (in radians)

R objects: five basic classes

  1. character: Represents text or strings.
  2. numeric: Represents numbers and is divided into:
    • double: Double-precision floating-point numbers.
    • integer: Integer numbers.
  3. complex: Represents complex numbers with real and imaginary parts.
  4. logical: Represents Boolean values (TRUE or FALSE).

To find out how an object is stored, you can use the function typeof().

Data structures